A major challenge in the use of traditional single-training-run \FDO\
is the selection of a profiling data input that is representative of
the execution of the program throughout its lifetime.  For large and
complex programs dealing with many use cases, and used by a multitude
of users, assembling an appropriately representative workload may be a
difficult task.  Picking a solitary training run to represent such a
space is far more challenging, or potentially impossible, if use-cases
are mutually-exclusive.  While benchmark programs can be modified to
combine such use-cases into a single run,
this approach is obviously inapplicable for real programs.  Moreover,
user workloads are prone to change over time.  Ensuring stable
performance across all inputs in today's workload prevents performance
degradation due to changes in the relative importance of workload
components.

Berube developed the {\em Combined Profiling} (\CP) statistical modelling technique that produces a {\it Combined Profile} (\CProf)
from a collection of traditional single-run profiles, thus
facilitating the collection and representation of profile information
over multiple runs~\cite{BerubePhD}. The use of many profiling runs, in turn, eases the
burden of training-workload selection and mitigates the potential for
performance degradation.  There is no need to select a single input
for training because data from any number of training runs can be
merged into a combined profile.  More importantly, \CP\ preserves
variations in execution behavior across inputs.  The distribution of
behaviors can be queried and analyzed by the compiler when making
code-transformation decisions.  Modestly profitable transformations
can be performed with confidence when they are beneficial to the
entire workload. On the other hand, transformations expected to be
highly beneficial on average can be suppressed when performance
degradation would be incurred on some members of the workload.

Combining profiles is a three-step process \cite{BerubeISPASS12}:
\begin{enumerate}
\item Collect raw profiles via traditional profiling.
\item Apply {\em Hierarchical Normalization} (\HN) to each raw profile. 
\item Apply \CP\ to the normalized profiles to create the combined profile.
\end{enumerate}
\CP\ provides a data representation for profile information, but does
not specify the semantics of the information stored in the combined
profile~\cite{BerubeICPE11}.  Naive combination of raw profiles, such as simple sums or arithmetic averages, can be very misleading.

\subsection{Hierarchical Normalization}
\label{cp:hn}

In \CP\ a monitor is used to measure an event during the execution of a program. For example a monitor may count the number of times that a specific edge in the control-flow graph is traversed. There is a problem when pairs of measurements are taken under different
conditions.  Thus, when
combining these measurements, all values recorded for a monitor must
be normalized relative to a common fixed reference.  {\em Hierarchical
  normalization} (\HN)  is designed for
\CP\ and decomposes a \CFG\ into a hierarchy
of dominating regions~\cite{BerubePhD}.

\subsection{Denormalization}
\label{cp:denorm}

The properties of a monitor $R_a$ can only be directly compared to
those of a monitor $R_b$ when $dom(a) = dom(b)$.  However, more
generalized reasoning about $R_a$ may be needed when considering code
transformations.  Similarly, when code is moved by a transformation,
its profile information must be correctly updated. {\it
  Denormalization} reverses the effects of hierarchical normalization
to lift monitors out of nested domination regions by marginalizing-out
the distribution of the dominators above which they are lifted.
Denormalization is a heuristic method rather than an exact statistical
inference because it assumes statistical independence between monitors.

\subsection{Queries}
\label{cp:queries}

In an Ahead-of-Time (AOT) compiler, profiles are used to predict program behavior.
Thus, raw profiles are statistical models that use a single sample to
answer exactly one question: {\em ``What is the expected frequency of
  X?''}  where X is an edge or path in a \CFG\ or a Call Graph (\CG).
A \CP\ is a much richer statistical model that can answer a wide range
of queries about the measured program behavior.  The implementation of
\CP\ used in this work provides several queries as
methods of a monitor's histogram. For each monitor, this queries include the maximum and non-zero minimum monitor value observed, the mean of the observed values, the standard deviation of the values, an estimation of the cummulative distribution function (CDF) of the values, a quantile for $0 \leq q \leq 1$, which is the poind where the CDF equals $q$, and others~\cite{BerubePhD}. 

This reach set of queries enables a compiler designer to obtem precise information about the nuances of the behavior of a monitor. Therefore,
\CP\ enables the accurate assessment of the
potential performance impact of transformations informed by
variable-behavior monitors in a variety of ways, and with adjustable
confidence in the result. Concrete examples of this kind of analysis
are provided by the implementation of an \FDO\ inliner using
\CP\ described in \cite{BerubePhD}.

The empirical-distribution methodology of \CP\ is orthogonal to the
techniques used to collect raw profiles.  \CP\ is applicable whenever
multiple profile instances are collected, including intra-run
phase-based profiles, profiles collected from hardware
performance-counter, and sampled profiles.  The main issue when
combining profiles is how normalization should be done in order to
preserve program-behavior characteristics.
